Search Results: "Ian Wienand"

6 December 2009

Ian Wienand: Distance to 1 million people

On a recent trip up the Oregon coast, a friendly doorman at our hotel in Portland was inquiring about our trip. When we mentioned we passed through Bandon, OR, he quipped that Bandon was the place furthest from a city of one million people in the USA. I guess a normal person would just think "oh, that's interesting" and move on, but it has been plaguing me ever since. Firstly, I had to find what cities in the USA had more than 1 million people. Luckily Wolfram Alpha gives the answer:

Chicago,Illinois
Dallas,Texas
Houston,Texas
Los Angeles,California
New York, New York
Philadelphia,Pennsylvania
Phoenix,Arizona
San Antonio,Texas
San Diego,California

(I certainly wouldn't have guessed that list!) From there my plan was to find the bounding box of the continental USA; luckily Wikipedia has the raw data for that. Combined with the latitude and longitude of the cities above, I had the raw data. I couldn't figure out any way better than a simple brute-force of testing every degree and minute of latitude and longitude within the bounding box and calculating the distance to the closest large city; the theory being that from one particular point you would have to travel further than any other to reach a city of 1 million people. Luckily, that is short work for a modern processor, and hopefully the result would be a point somewhere around Bandon. I'd already become acquainted with the great circle and measuring distances when I did Tinymap, so a quick python program evolved. However, it turns out that the program picks the far south-east corner of the bounding box. Thanks to the shape of the USA, that is way out over the ocean somewhere. I can't figure out a way to get an outline of the USA to test if a given point is inside the border or not, but all is not lost. I modified the program to output the the distance to the closest large city along with the location to a log file, and then imported it into gnuplot to make a heat-map. The hardest part was finding an equirectangular outline of the USA to place the heat-map over, rather than a much more common Mercator projection; Wikimedia to the rescue! I actually surprised myself at how well the two lined up when, after a little work with Gimp, I overlayed them (big) Distance to a city of 1 million people (km)

Distance to a city of 1 million people (km)

From this, I can see that Bandon, about a third of the way up the Oregon coast, is a pretty good candidate. However, probably not the best; I get the feeling the real point that is the furthest from any city of 1 million people is actually somewhere in the central-middle of Montana. However, we can also fiddle the program slightly to disprove the point about Bandon. The numbers show the closest large city to Bandon is LA, at ~1141km. Taking another point we suspect to be more remote; the closest large city to Portland (where we met the doorman) is also LA at ~1329km. So to reach the closest large city you have to travel further from Portland than Bandon, so Bandon is not the furthest place in the USA from a city of one million people. Myth busted!

20 November 2009

Ian Wienand: Go L4!

By now everybody has now heard about Go, Google's expressive, concurrent, garbage collecting language. One big, glaring thing stuck out at me when I was reading the documentation:

Do not communicate by sharing memory; instead, share memory by communicating.

One of the examples given is a semaphore using a channel, which I'll copy here for posterity.

var sem = make(chan int, MaxOutstanding)
func handle(r *Request)  
    sem <- 1;    // Wait for active queue to drain.
    process(r);  // May take a long time.
    <-sem;       // Done; enable next request to run.
 
func Serve(queue chan *Request)  
    for  
        req := <-queue;
        go handle(req);  // Don't wait for handle to finish.

Here is a little illustration of that in operation. Semaphores with Google Go

Serve creates goroutines via the go keyword; each of which tries to get a slot in the channel. In the example there are only 3 slots, so it acts like a semaphore of count 3. When done, each thread returns its slot to the channel, which allows anyone blocked to be woken and continued. This instantly reminded me of the very first thing you need to do if you ever want to pass Advanced Operating Systems -- write a semaphore server to provide synchronisation within your OS. In L4, threads communicate with each other via inter-process communication (IPC). IPC messages have a fixed format - you specify a target thread, bundle some data into the available slots in the IPC format and fire it off. By default you block waiting for a reply -- this all happens within a single call for efficiency. On the other side, you can write servers who are listening for remote IPC connections, where everything happens in reverse. Here's another illustration the of the trivial semaphore server concept Shehjar and I implemented. L4 semaphore server example

Look familiar? Instead of a blocking push of a number into a slot into a channel, you make a blocking IPC call to a remote server. My point here is that both take the approach of sharing memory via communication. When using IPC, you bundle up all your information into the available slots in the IPC message and send it. When using a channel, you bundle your information into an entry in the channel and call your goroutine. Receiving the IPC is the same as draining a channel - both result in you getting the information that was bundled into it by the caller.

IPC	Go
Start thread	Start goroutine
New thread blocks listening for IPC message	Goroutine blocks draining empty channel
Bundle information into IPC message	Bundle data into type of your channel
Send IPC to new thread	Push data into channel
Remote thread unbundles IPC	goroutine drains channel and gets data

Whenever you mention the word "microkernel", people go off the deep-end and one thing they seem to forget about is the inherent advantages of sharing state only via communication. As soon as you do that, you've broken open an amazing new tool for concurrency, which is now implicitly implied. By communicating via messages/channels rather than shared global state, it doesn't matter where you run! One of those threads in the example could be running on another computer in your cloud, marshalling up it's IPC messages/channel entries and sending them over TCP/IP -- nobody would care! At any rate, do not communicate by sharing memory; instead, share memory by communicating is certainly an idea whose time has come.

11 September 2009

Ian Wienand: Django toolchain on Debian

Although Django is well packaged for Debian, I've recently come to the conculsion that the packages are really not what I want. The problem is that my server runs Debian stable, while my development laptop runs unstable, and Django revisions definitely fall into the "unstable" category. There really is no way to use a system Django 1.1 on one side, and a system Django 1.0 on the other. After a bit of work, I think I've got something together that works, and I post it here in the hope it is useful for someone else. This info has been gleaned from similar references such as this and this. This is aimed at running a server using Debian stable (5.0) for production and an unstable environment for development. You actually need both to get this running. This is based on a project called "project" that lives in /var/www

First step is to install python-virtualenv on both.
Create a virtualenv on both, using the --no-site-packages to make it a stand-alone environment. This is like a chroot for python.
```
$ virtualenv --no-site-packages project
New python executable in project/bin/python
Installing setuptools............done.
```
The unstable environment has a file you'll need to copy into the stable environment - bin/activate_this.py. The stable version of python-virtualenv isn't recent enough to include this file, but you need it to essentially switch the system python into the chrooted environment. This will come in handy later when setting up the webserver.
There are probably better ways to keep the two environments in sync, but I simply take a manual approach of doing everything twice, once in each. So from now on, do the following in both environments.

Activate the environment

/var/www$ cd project
/var/www/project$ . bin/activate
(project) /var/www/project$

Use easy_install to install pip

(project) /var/www/project$ easy_install pip
Searching for pip
Reading http://pypi.python.org/simple/pip/
Reading http://pip.openplans.org
Best match: pip 0.4
Downloading http://pypi.python.org/packages/source/p/pip/pip-0.4.tar.gz#md5=b45714d04f8fd38fe8e3d4c7600b91a2
Processing pip-0.4.tar.gz
Running pip-0.4/setup.py -q bdist_egg --dist-dir /tmp/easy_install-Wu9O-U/pip-0.4/egg-dist-tmp-xjSdxq
warning: no previously-included files matching '*.txt' found under directory 'docs/_build'
no previously-included directories found matching 'docs/_build/_sources'
zip_safe flag not set; analyzing archive contents...
pip: module references __file__
Adding pip 0.4 to easy-install.pth file
Installing pip script to /var/www/project/bin
Installed /var/www/project/lib/python2.5/site-packages/pip-0.4-py2.5.egg
Processing dependencies for pip
Finished processing dependencies for pip

Install setuptools, also using easy_install (for some reason, pip can't install it). There is a trick here, you need to specify at least version 0.6c9 or there will be issues with the SVN version on Debian stable when you try to get Django in the next step.

(project) /var/www/project$ easy_install setuptools==0.6c9
Searching for setuptools==0.6c9
Reading http://pypi.python.org/simple/setuptools/
Best match: setuptools 0.6c9
Downloading http://pypi.python.org/packages/2.5/s/setuptools/setuptools-0.6c9-py2.5.egg#md5=fe67c3e5a17b12c0e7c541b7ea43a8e6
Processing setuptools-0.6c9-py2.5.egg
Moving setuptools-0.6c9-py2.5.egg to /var/www/project/lib/python2.5/site-packages
Removing setuptools 0.6c8 from easy-install.pth file
Adding setuptools 0.6c9 to easy-install.pth file
Installing easy_install script to /var/www/project/bin
Installing easy_install-2.5 script to /var/www/project/bin
Installed /var/www/project/lib/python2.5/site-packages/setuptools-0.6c9-py2.5.egg
Processing dependencies for setuptools==0.6c9
Finished processing dependencies for setuptools==0.6c9

Create a requirements.txt with the path to the Django SVN for pip to install, then and then install it.

(project) /var/www/project$ cat requirements.txt
-e svn+http://code.djangoproject.com/svn/django/tags/releases/1.0.3/#egg=Django
(project) /var/www/project$ pip install -r requirements.txt
Obtaining Django from svn+http://code.djangoproject.com/svn/django/tags/releases/1.0.3/#egg=Django (from -r requirements.txt (line 1))
  Checking out http://code.djangoproject.com/svn/django/tags/releases/1.0.3/ to ./src/django
(project) /var/www/project$ pip install -r requirements.txt
Obtaining Django from svn+http://code.djangoproject.com/svn/django/tags/releases/1.0.3/#egg=Django (from -r requirements.txt (line 1))
  Checking out http://code.djangoproject.com/svn/django/tags/releases/1.0.3/ to ./src/django
... so on ...

Almost there! You can keep installing more Python requirements with pip if you need, but we've got enough here to start.
Create a file in /var/www/project called project-python.py. This will be the Python interpreter the webserver uses, and basically exec's itself into the virtalenv. The file should contain the following:
```
activate_this = "/var/www/project/bin/activate_this.py"
execfile(activate_this, dict(__file__=activate_this))
from django.core.handlers.modpython import handler
```
Now it's time to start the Django project. I like to create a new directory called project, which will be the parent directory kept in the SCM with all the code, media, templates, database (if using SQLite) etc. In this way to keep the two environments up-to-date I simply svn ci on one side, and svn co on the other.
```
(project) /var/www/project$ mkdir project
(project) /var/www/project/project$ mkdir db django media www
(project) /var/www/project/project$ cd django/
(project) /var/www/project/project/django$ django-admin startproject myproject
```

Last step now is to wire-up Apache to serve it all up. The magic is making sure you specify the correct PythonHandler that you made before to use the virtualenv, and include the right paths so you can find it and all the required Django settings.

DocumentRoot /var/www/project
<Location "/">
    SetHandler python-program
    PythonHandler project-python
    PythonPath "['/var/www/project/','/var/www/project/project/django/'] + sys.path"
    SetEnv DJANGO_SETTINGS_MODULE myproject.settings
    PythonDebug On
</Location>
Alias /media /var/www/project/project/media
<Location "/media">
    SetHandler none
</Location>
<Directory "/var/www/project/project/media">
    AllowOverride none
    Order allow,deny
    Allow from all
    Options FollowSymLinks Indexes
</Directory>

With all this, you should be up and running in a basic but stable environment. It's easy enough to update packages for security fixes, etc via pip after activating your virtualenv.

18 May 2009

Ian Wienand: Dig Jazz Applet, V2

It seems the ABC updated the DIG Jazz now-playing list format, breaking V1. Some quick flash disassembly and a bit of hacking, and order is restored. As a bonus, it now shows the upcoming songs. DIG Jazz now-playing Gnome applet

Source or Debian package.

7 May 2009

Ian Wienand: Quickly describing hash utilisation

I think the most correct way to describe utilisation of a hash-table is using chi-squared distributions and hypothesis and degrees of freedom and a bunch of other things nobody but an actuary remembers. So I was looking for a quick method that was close-enough but didn't require digging out a statistics text-book. I'm sure I've re-invented some well-known measurement, but I'm not sure what it is. The idea is to add up the total steps required to look-up all elements in the hash-table, and compare that to the theoretical ideal of a uniformly balanced hash-table. You can then get a ratio that tells you if you're in the ball-park, or if you should try something else. A diagram should suffice. Scheme for acquiring a hash-utilisation ratio

Scheme for acquiring a hash-utilisation ratio

This seems to give quite useful results with a bare minimum of effort, and most importantly no tricky floating point math. For example, on the standard Unix words with a 2048 entry hash-table, the standard DJB hash came out very well (as expected)

Ideal 2408448
Actual 2473833
----
Ratio 0.973569

To contrast, a simple "add each character" type hash:

Ideal 2408448
Actual 6367489
----
Ratio 0.378241

Example code is hash-ratio.py. I expect this measurement is most useful when you have a largely static bunch of data for which you are attempting to choose an appropriate hash-function. I guess if you are really trying to hash more or less random incoming data and hence only have a random sample to work with, you can't avoid doing the "real" statistics.

12 March 2009

Ian Wienand: Relocation truncated to fit - WTF?

If you code for long enough on x86-64, you'll eventually hit an error such as:

(.text+0x3): relocation truncated to fit: R_X86_64_32S against symbol  array' defined in foo section in ./pcrel8.o

Here's a little example that might help you figure out what you've done wrong. Consider the following code:

$ cat foo.s
.globl foovar
  .section   foo, "aw",@progbits
  .type foovar, @object
  .size foovar, 4
foovar:
   .long 0
.text
.globl _start
 .type function, @function
_start:
  movq $foovar, %rax

In case it's not clear, that would look something like:

int foovar = 0;
void function(void)  
  int *bar = &foovar;

Let's build that code, and see what it looks like

$ gcc -c foo.s
$ objdump --disassemble-all ./foo.o
./foo.o:     file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <_start>:
   0:		 48 c7 c0 00 00 00 00	mov    $0x0,%rax
Disassembly of section foo:
0000000000000000 <foovar>:
   0:		 00 00			add    %al,(%rax)
   ...

We can see that the mov instruction has only allocated 4 bytes (00 00 00 00) for the linker to put in the address of foovar. If we check the relocations:

$ readelf --relocs ./foo.o
Relocation section '.rela.text' at offset 0x3a0 contains 1 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000003  00050000000b R_X86_64_32S      0000000000000000 foovar + 0

The R_X86_64_32S relocation is indeed only a 32-bit relocation. Now we can tickle this error. Consider the following linker script, which puts the foo section about 5 gigabytes away from the code.

$ cat test.lds
SECTIONS
 
 . = 10000;
 .text :   *(.text)  
 . = 5368709120;
 .data :   *(.foo)

This now means that we can not fit the address of foovar inside the space allocated by the relocation. When we try it:

$ ld -Ttest.lds ./foo.o
./foo.o: In function  _start':
(.text+0x3): relocation truncated to fit: R_X86_64_32S against symbol  foovar' defined in foo section in ./foo.o

What this means is that the full 64-bit address of foovar, which now lives somewhere above 5 gigabytes, can't be represented within the 32-bit space allocated for it. For code optimisation purposes, the default immediate size to the mov instructions is a 32-bit value. This makes sense because, for the most part, programs can happily live within a 32-bit address space, and people don't do things like keep their data so far away from their code it requires more than a 32-bit address to represent it. Defaulting to using 32-bit immediates therefore cuts the code size considerably, because you don't have to make room for a possible 64-bit immediate for every mov. So, if you want to really move a full 64-bit immediate into a register, you want the movabs instruction. Try it out with the code above - with movabs you should get a R_X86_64_64 relocation and 64-bits worth of room to patch up the address, too. If you're seeing this and you're not hand-coding, you probably want to check out the -mmodel argument to gcc.

2 March 2009

Ian Wienand: YUI ButtonGroup Notes

Some tips and things to check if your YUI ButtonGroup isn't behaving as you wish it would.

Double-check your <body> tag has class="yui-skin-sam"
Unlike in the documentation example, you can't just put a call to YAHOO.widget.ButtonGroup pointing to your div anywhere in your HTML and expect it to work. You've got to wait for it to be ready with something like:
```
<script type="text/javascript">
YAHOO.util.Event.onContentReady("my_button_div", function()  
  var oButtonGroup = new YAHOO.widget.ButtonGroup("my_button_div");
 );
</script>
```

You can easily get an image in each button. For example, if your button is defined as:

 <span id="my-button-id" class="yui-button yui-radio-button yui-button-checked">
  <span class="first-child">
    <button type="button" hidefocus="true"></button>
  </span>
 </span>

Simply add a CSS class something like:

.yui-button#my-button-id button   background:url(http://server/image.jpg) 50% 50% no-repeat;

Hopefully, this will save someone else a few hours!

26 February 2009

Ian Wienand: rdtsc - now even less useful!

An interesting extract from the latest IA32 SDM (18.20.5)

The TSC, IA32_MPERF, and IA32_FIXED_CTR2 operate at the same, maximum-resolved frequency of the platform, which is equal to the product of scalable bus frequency and maximum resolved bus ratio.

For processors based on Intel Core microarchitecture, the scalable bus frequency is encoded in the bit field MSR_FSB_FREQ[2:0] at (0CDH), see Appendix B, "Model-Specific Registers (MSRs)". The maximum resolved bus ratio can be read from the following bit field:

If XE operation is disabled, the maximum resolved bus ratio can be read in MSR_PLATFORM_ID[12:8]. It corresponds to the maximum qualified frequency.

IF XE operation is enabled, the maximum resolved bus ratio is given in MSR_PERF_STAT[44:40], it corresponds to the maximum XE operation frequency configured by BIOS.

In summary, TSC increment = (scalable bus frequency) * (maximum resolved bus ratio). This implies the TSC is incrementing based on some external bus source (any hardware engineers explain what happened for Core here?), and is a departure from simply assuming that the TSC increments once for each CPU cycle. The interesting bit is that if XE operation is disabled, the bus ratio is assumed to be the maximum qualified frequency. This seems to mean that if you overclock your CPU and your processor is running at higher than the qualified frequency, attempts to measure the CPU speed by counting TSC ticks over a given time may yeild the wrong results (well, will yield the rated result; i.e. the speed of the processor you bought out of the box). While interesting, this divergence is probably has little practical implications because using the TSC for benchmarking is already fraught with danger. You have to be super careful to make sure the compiler and processor don't reschedule things around you and handle other architectural nuances. If you need this level of information, you're much better using the right tools to get it (my favourite is perfmon2).

8 February 2009

Ian Wienand: Converting DICOM images

If you go for an ultrasound or some other imaging procedure, they may give you a CD with the images that requires some overly complicated and under-featured Windows viewer. Chances are these images are in DICOM format, which is like the AVI of the medical world. Your first clue will be that file might report the file as an unoptimised QuickTime movie, e.g.

$ file ./QMAG0001
./QMAG0001: Apple QuickTime movie (unoptimized)

After figuring out the file type wasn't actually anything to do with QuickTime, I tried some of the many different tools and methods to convert this to something viewable. Unfortunatley, the DICOM viewer in GIMP and ImageMagick (probably the same thing?) didn't like the files at all, and neither did a range of other tools. I finally managed to do it with the dcm2pnm tool from the Debian dcmtk package -- just point it at the file and it spits out a PNM which is easily converted by all graphics tools. You can also encapsulate a series of images in a DICOM file, like a little movie. dcm2pnm extracts these easily, but requires the --all-frames options. An ffmpeg recipe to turn these extracted files into a more easily viewable movie is:

$ ffmpeg -qscale 5 -r 20 -b 9600 -i foo.%d.ppm movie.mp4

I certainly can't guarantee this will actually work for you, as DICOM appears to be an extremely complicated format with many possible vendor extensions. But hopefully it's a starting point!

4 February 2009

Ian Wienand: On Complexity

Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it.

Alan J. Perlis, Eipgrams on Programming, SIGPLAN Notices Vol. 17, No. 9, September 1982, pages 7-13.

Ian Wienand: NoMachine NX - the missing non-manual

I've been meaning to try NoMachine NX for a while. Its promise of fast remote X11 sessions sounded exactly like what I wanted to log into my work desktop remotely (I really like having a remote desktop with saved state you can just pick up from when using remote access). That was pretty much all I knew about the software, so I was a completely blank slate. The getting started guide is the perfect example of how not to write a getting started guide. Firstly, Section 1 - "Getting started" - gives me a full history of the product, goes into significant depth about the challenges of forwarding X11 requests, talks about the caching and compression implementation, round-trip latency measurement, the details of two-way proxying system and discusses every other feature of the software. My eyes glazed over after about the first paragraph. That's all great -- I just want to know what to do! At this point, I assume that I'm required to run some sort of daemon at the remote end. I download and install the server package (it is explained that the server package requires the client and agent packages as well, fine). I'm paging down, looking for something to get me started. I'm happy to see Section 7 - "Set up your NX Server environment" (remember, at this point I though I needed some daemon running in the background constantly). It even has some commands commands to type, so I tap away, running nxserver --useradd nxtest --system. My server binary doesn't even seem to recognise these options. I give up, assuming that the server isn't running and nothing will work. The getting started guide has abruptly ended and I have no idea what to do. As it turns out, it's all completely trivial. Here's the missing "getting started guide".

Download and install the client, agent and server packages on the remote end. You need to have ssh access to this box.
Install the client on your end.
Run /usr/NX/bin/nxclient. It will start a wizard where you input the remote host name.
The client will, under the hood, ssh to the remote end, open the tunnel it needs, start the server and do all the magic required to make things "just work". A remote desktop will appear.
That's it!

Additional tips:

It's easy to tunnel this connection (for example, if you have to bounce through a ssh gateway to your internal network). Do something like /usr/NX/bin/nxssh -o 'Compression=no' -L 2022:remote.host:22 -f -N user@sshgateway.company.com and then connect the client to localhost:2022. You don't want to compress this link, as NX is already doing it.
The only way I can find to make a new session is to start nxclient with the --wizard command.
Don't click "Disable encryption of all traffic" if you're tunneling. AFAICT this tries to redirect the client to a non-encrypted port, which obviously won't get through.

Other than the documentation, it really works as promised, making remote X11 usable. One really nice feature is that it is smart about the resolution of the remote desktop, filling up your local screen. Add to that you don't need anything setup but your normal ssh connection, and it's a great remote desktop solution.

9 January 2009

Ian Wienand: Facebook, API's, photos and IPTC data

As a photo management application, Facebook sucks. But it is something that people actually look at (as opposed to Flickr, which is great, but getting people to log-in or follow special guest pass links is a PITA). I like to keep all my raw photos locally, using IPTC for comments (which Flickr reads -- I put them in using some custom scripts and the Python bindings of libiptcdata) and geo-tagged in the EXIF data (using my google maps point locator). I figure this way if Flickr goes bust/gets bought by Microsoft all I need to do is re-upload somewhere else. I was waiting for Flickr to integrate with Facebook in some good way, but I then came across the very useful pyfacebook bindings, which, although being a little light on documentation, is a great way to easily throw my photos into Facebook (it's pending the NEW queue in Debian, see #511279). My fbupload.py script might be a useful starting point if you want to do the same thing. It batches up photos into lots of 60 (the maximum photos in an album) and automatically creates the albums and uploads the photos, reading the IPTC data for comments. The only problem is that you'll have to sign up for a developer key and start a new application to get a secret key to talk to the API (if you're still reading this, I'm sure you can figure it out!).

22 December 2008

Ian Wienand: Streaming various radio streams to FStream on the iPhone

FStream is a really neat streaming radio program for the iPhone. Although it supports various WMA streams, I found that it did not successfully work with some of the Australian ABC WMA streaming radio services. The most reliable method seems to simply use a low-bandwidth MP3 stream over HTTP (24 kbps sounds fine and works great even over Edge). I could find a number of other blogs, etc. with static methods for streaming, but nothing that reliably did on-the-fly conversion of an incoming stream. My solution is simple Python HTTP server I'm calling stream2mp3. It uses mplayer, lame and a few pipes to take the incoming stream (which is pretty much anything mplayer can handle, which is pretty much anything unencrypted) and spit it out as a low-bandwidth MP3 stream over HTTP. It seems to reliably handle dropped and closed connections, and clean-up after itself. I'd certainly be interested in any bug fixes or suggestions. I guess the major disadvantages is you need a dedicated server (get yourself a linode!), it only handles one connection at a time, and if you want multiple stations I guess you run multiple instances on different ports. With this, you can be sitting in traffic on the 101 heading to San Francisco and, with some local radio, it's just like you're sitting in traffic on the M2 in Sydney! Here's a screenshot:

~/bin$ python stream2mp3.py
Creating WAV fifo /tmp/incoming.wav
Creating MP3 fifo /tmp/output.mp3
Serving <mms://media3.abc.net.au/702Sydney> on port XXXX
mplayer running as 8524
lame running as 8525
mobile-XXX-XXX-130-107.mycingular.net - - [23/Dec/2008 18:59:22] "GET / HTTP/1.1" 200 -
[radio plays until I stop it...]
connection lost
cleanup complete, ready

25 November 2008

Ian Wienand: Position Independent Code and x86-64 libraries

If you've ever tried to link non-position independent code into a shared library on x86-64, you should have seen a fairly cryptic error about invalid relocations and missing symbols. Hopefully this will clear it up a little! Let's start with a small program to illustrate.

$ cat function.c
int global = 100;
int function(int i)  
	return i + global;
 
$ gcc -c function.c

Firstly, inspect the disassembley of this function:

0000000000000000 <function>:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	89 7d fc             	mov    %edi,-0x4(%rbp)
   7:	8b 05 00 00 00 00    	mov    0x0(%rip),%eax        # d <function+0xd>
   d:	03 45 fc             	add    -0x4(%rbp),%eax
  10:	c9                   	leaveq
  11:	c3                   	retq

Lets just go through that for clarity:

0,1: save rbp to the stack and save the stack pointer (rsp) to rbp. This common stanza is setting up the frame pointer, which is essentially a rule used by debuggers (mostly) to keep track of the base of the stack. It's not important for now.
4:Move the value from edi to 4 bytes below the stack pointer. This is moving the first argument (int i) into the "red-zone", a 128-byte scratch area each function has reserved below the stack pointer.
7,d: Move the value at offset 0 from the current instruction pointer (rip) into eax (i.e. the return value). Then add the incoming argument to it (retrieved from the scratch area).

The IP relative move is really the trick here. We know from the code that it has to move the value of the global variable here. The zero value is simply a place holder - the compiler currently does not determine the address (i.e. how far away from the instruction pointer the memory holding the global variable is. It leaves behind a relocation -- a note that says to the linker "you should determine the correct address of foo, and then patch this address of the code to point to that addresss (i.e. foo)." Relocations with addend

The image above gives some idea of how it works. We can examine relocations with the readelf tool.

$ readelf --relocs ./function.o
Relocation section '.rela.text' at offset 0x518 contains 1 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000009  000800000002 R_X86_64_PC32     0000000000000000 global + fffffffffffffffc

If you try and build a shared object (dynamic library) with this function, you should get something like

$ gcc -shared function.c
/usr/bin/ld: /tmp/ccQ2ttcT.o: relocation R_X86_64_32 against  a local symbol' can not be used when making a shared object; recompile with -fPIC
/tmp/ccQ2ttcT.o: could not read symbols: Bad value
collect2: ld returned 1 exit status

Position Independent Code (PIC, enabled with -fPIC) just means that the output binary does not expect to be loaded at a particular base address, but is happy being put anywhere in memory (compare the output of readelf --segments on a binary such as /bin/ls to that of any shared library). This is obviously critical for implementing shared-libraries, where you may have many, many libraries loaded in essentially any order, and trying to pre-allocate where in memory they would all live just does not work. However, PIC has slightly wider implications than simply movable base addresses. So, back to relocations. The exact rules for different relocation types are described in the ABI for the architecture. The R_X86_64_PC32 relocation is defined as "the base of the section the symbol is within, plus the symbol value, plus the addend". The addend makes it look more tricky than it is; remember that when an instruction is executing the instruction pointer points to the next instruction to be executed. Therefore, to correctly find the data relative to the instruction pointer, we need to subtract the extra. This can be seen more clearly when layed out in a linear fashion (as in the bottom of the above diagram). What's the problem with this in a shared library? In a shared library situation, we can not depend on the local value of global actually being the one we want. Consider the following example, where we override the value of global with a LD_PRELOAD library.

$ cat function.c
int global = 100;
int function(int i)  
	return i + global;
 
$ gcc -fPIC -shared -o libfunction.so function.c
$ cat preload.c
int global = 200;
$ gcc -shared preload.c -o libpreload.so
$ cat program.c
#include <stdio.h>
int function(int i);
int main(void)  
   printf("%d\n", function(10));
 
$ gcc -L. -lfunction program.c -o program
$ LD_LIBRARY_PATH=. ./program
110
$ LD_PRELOAD=libpreload.so LD_LIBRARY_PATH=. ./program
210

If the code in libfunction.so has a fixed offset into its own data section, it will not be able to see the overridden value provided by libpreload.so. This is not the case when building a stand-alone executable, where references are satisfied internally. Of course, any problem in computer science can be solved with a layer of abstraction, and that is what is done when compiling with -fPIC. To examine this case, let's see what happens with PIC turned on.

$ gcc -fPIC -shared -c  function.c
$ objdump --disassemble ./function.o
./function.o:     file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <function>:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	89 7d fc             	mov    %edi,-0x4(%rbp)
   7:	48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax        # e <function+0xe>
   e:	8b 00                	mov    (%rax),%eax
  10:	03 45 fc             	add    -0x4(%rbp),%eax
  13:	c9                   	leaveq
  14:	c3                   	retq

It's almost the same! We setup the frame pointer with the first two instructions as before. We push the first argument into memory in the pre-allocated "red-zone" as before. Then, however, we do an IP relative load of an address into rax. Next we de-reference this into eax (e.g. eax = *rax in C) before adding the incoming argument to it and returning.

$ readelf --relocs ./function.o
Relocation section '.rela.text' at offset 0x550 contains 1 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000000a  000800000009 R_X86_64_GOTPCREL 0000000000000000 global + fffffffffffffffc

The magic here is again in the relocations. Notice this time we have a P_X86_64_GOTPCREL relocation. This says "replace the data at offset 0xa with the global offset table (GOT) entry of global. Global Offset Table operation with data variables

Global Offset Table operation with data variables

As shown above, the GOT ensures the abstraction required so symbols can be diverted as expected. Each entry is essentially a pointer to the real data (hence the extra dereference in the code above). Since the GOT is at a fixed offset from the program code, it can use an IP relative address to gain access to the table entries. This extra reference is obviously slower; however for the most part I imagine the overhead would be essentially immeasurable and is required for "generic" operation. If you have figured the cost of indirection through the GOT is the major bottleneck of your program, I imagine you wouldn't be reading this and would already be considering strategies to remove it! The next question is why this works on plain old x86-32. Inspecting the code reveals why:

$ objdump --disassemble ./function.o
00000000 <function>:
   0:	55                   	push   %ebp
   1:	89 e5                	mov    %esp,%ebp
   3:	a1 00 00 00 00       	mov    0x0,%eax
   8:	03 45 08             	add    0x8(%ebp),%eax
   b:	5d                   	pop    %ebp
   c:	c3                   	ret
$ readelf --relocs ./function.o
Relocation section '.rel.text' at offset 0x2ec contains 1 entries:
 Offset     Info    Type            Sym.Value  Sym. Name
00000004  00000701 R_386_32          00000000   global

We start out the same, with the first two instructions setting up the frame pointer. However, next we load a memory value into eax -- as we can see from the relocation information, the address of global. Next we add the incoming argument from the stack (0x8(%ebp)) to the value in this memory location; implicitly dereferencing it. This provides the abstraction we need -- if the relocation makes the patched address at 0x4 the address of the GOT entry, it will be correctly dereferenced. It is the inability of the x86-32 architecture to try and optimise by doing instruction-pointer relative offseting which means it always needs to do slower memory references, which turns out to be just what you want when you're making a shared library! So, the executive summary: the ability of x86-64 to use instruction-pointer relative offsetting to data addresses is a nice optimisation, but in a shared-library situation assumptions about the relative location of data are invalid and can not be used. In this case, access to global data (i.e. anything that might be changed around on you) must go through a layer of abstraction, namely the global offset table.

10 November 2008

Ian Wienand: When craziness wraps around...

Rob Landley writes:

A common trick years ago was to set up your routing tables and then have PID 1 exit so the kernel paniced, because the paniced kernel would continue to route packets with _no_userspace_running_. Darn hard to hack a system like that.

This is such a ridiculously stupid idea I think it has wrapped all the way around to the point where it just grazes "genius"!

7 November 2008

Ian Wienand: Spot the bug!

See if you can spot the bug in this code!

#include <stdio.h>
#include <stdlib.h>
int main(void)
 
	union  
		unsigned char raw[8];
		struct  
			char one;
			int two;
			char three;
			char four;
			char five;
		  formatted __attribute__((packed));
	  test;
	printf("one   : %d\n", (int)&test.formatted.one - (int)&test);
	printf("two   : %d\n", (int)&test.formatted.two - (int)&test);
	printf("three : %d\n", (int)&test.formatted.three - (int)&test);
	printf("four  : %d\n", (int)&test.formatted.four - (int)&test);
	printf("five  : %d\n", (int)&test.formatted.five - (int)&test);
	return 0;

$ gcc -Wall  -o packing packing.c
$ ./packing
one   : 0
two   : 4
three : 8
four  : 9
five  : 10

Here's the relevant bit from the gcc manual:

For an enum, struct or union type, you may specify attributes either between the enum, struct or union tag and the name of the type, or just past the closing curly brace of the definition. The former syntax is preferred.

By getting the packed attribute in the wrong place (if it's not clear, it should be before formatted), it is applied to the variable rather than the type. The compiler (both gcc and icc do this) has already laid out the structure, so misses the packing directive, and unfortunately doesn't warn (that may be a bug?). I can tell you from experience this can be hard to track down!

8 October 2008

Ian Wienand: Why symbol visibility is good

ELF has two related concepts for managing symbols in your programs. The first concept is the symbol binding. Global binding means the symbol is visible outside the file being built; local binding is the opposite and keeps the symbol local only (static) and weak is like global, but suggests that the symbol can be overridden.

$ cat syms.c
static int local(void)    
int global(void)    
int  __attribute__((weak)) weak(void)    
$ gcc -o syms -c syms.c
$ readelf --syms ./syms
Symbol table '.symtab' contains 10 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
...
     5: 00000000     8 FUNC    LOCAL  DEFAULT    1 local
     8: 00000008     8 FUNC    GLOBAL DEFAULT    1 global
     9: 00000010     8 FUNC    WEAK   DEFAULT    1 weak
...

This is all well and good, but starts breaking down when you want to load many different modules and keep strict API's (such as, say, dynamic libraries!). Consider that for two files to share a common function, the function must end up with a global visibility.

$ cat file1.c
void common_but_not_part_of_api(void)    
$ cat file2.c
extern void common_but_not_part_of_api(void);
void api_function(void)  
     common_but_not_part_of_api();
 
$ gcc -shared -fPIC  -o library file1.c file2.c
$ readelf --syms ./library
Symbol table '.symtab' contains 60 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
...
    53: 00000424    29 FUNC    GLOBAL DEFAULT   11 api_function
    55: 0000041c     5 FUNC    GLOBAL DEFAULT   11 common_but_not_part_of_api
...

In the example above, both the function we want exported (api_function) and the function we don't wish exported (common_but_not_part_of_api) end up with exactly the same attributes. Binding attributes are useful for the linker putting together object files; but aren't a complete solution. To combat this, ELF provides for visibility attributes. Symbols can be default, protected, hidden or internal. Using these attributes, we can flag extra information for the dynamic loader so it can know which symbols are for public consumption, and which are for internal use only. The most logical way to use this is to make all symbols by default hidden with -fvisibility=hidden and then "punch holes in the wall" for those symbols you want visible.

$ cat file1.c
void common_but_not_part_of_api(void)    
$ cat file2.c
extern void common_but_not_part_of_api(void);
void  __attribute__((visibility("default"))) api_function(void)  
      common_but_not_part_of_api();
 
$ gcc -fvisibility=hidden -shared -fPIC  -o library file1.c file2.c
$ readelf --syms ./library
Symbol table '.symtab' contains 60 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
    48: 000003cc     5 FUNC    LOCAL  HIDDEN   11 common_but_not_part_of_api
    54: 000003d4    29 FUNC    GLOBAL DEFAULT  11 api_function

Now the dynamic loader has enough information to distinguish between the two, and can stop any external access to common_but_not_part_of_api easily. This extra information also has potential for performance improvements. Any time a symbol may be overridden, the compiler must generate a program lookup table (PLT) entry for the function so that the dynamic loader can re-direct the function call. The PLT is a trampoline which gets the correct address of the function being called (from the global offset table, GOT) and bounces the function call to the right place. An example should illustrate: Bouncing via the PLT

In the first example, there was not enough information to tell if the function would ever be able to be overridden, hence a PLT entry had to be created and the function called through it (disassemble it to see the details!). With correct symbol visibility attributes, there is enough information to know that common_but_not_part_of_api is never to be overridden, hence the PLT (and the associated costs of trampolining) can be avoided. The internal attribute is even stricter; it says that this function will never be called from outside this module (for example, we might pass the address of common_but_not_part_of_api to anyone). This can lead to even better code, because on many architectures transitioning to another module might involve flipping global pointer registers or other similarly expensive operations. So that's how symbol binding and visibility attributes can work together to get you the best performance possible from your program!

3 October 2008

Ian Wienand: Poking around in AUXV, part 2

I've written about AUXV previously, focusing on one of its most interesting applications -- its role in helping find linux-gate.so.1. If you're starting your program, you can get the dynamic loader to echo out the AUXV fields with the environment variable LD_SHOW_AUXV, but if your process has started you'll need to pull the values out of /proc/pid/auxv directly. This is pretty internal stuff for the dynamic loader and is probably only useful if you're writing a debugger or doing some other low-level tricks (such as debugging!). However, should you need to, here is some sample code which does just that. Hopefully it will save someone else some time!

29 September 2008

Ian Wienand: Kernel Development Course

Here are some slides and examples I used for a kernel course I developed (some time ago now). The course was aimed at C developers who wanted an introduction to both general UNIX-style user-space and Linux kernel development with a focus on embedded systems issues. The course is aimed at two 8-hour days, and is pretty packed in even then. The first day is user-space development and kernel building, focusing on things like make, autotools, advanced gcc, getting cross-compilers working, configuring the kernel and building. The second day we get into kernel internals; building up a kernel module to produce some simple proc nodes, take data, crash and debug, etc, look at internals like concurrency and the driver model, and focus on USB quite a bit. Here is a tarball of the entire thing, including the examples. Hopefully these can help out anyone tackling the design of such a course.

Day 1
- Linux philosophy, people and processes
- UNIX build environment
  - Makefiles, Autotools
- Executable and Linker Format
- Advanced GCC
- Linker, Dynamic Linker, Shared Libraries
- Cross compilation toolchain
  - Building toolchains on Debian
- Kernel development process
- Source control, patches, GIT, Quilt
- Configuring and building a kernel
- Cross compiling kernel
- initrd, initramfs, init tools

Day 2
- Kernel source layout
- Kbuild, Kconfig
- Syscalls (x86-32)
- Kernel view of userspace
- Virtual Memory
- Getting data in/out kernel
- Crashing, Oops, backtraces
- Various tools
  - systemtap, kprobes
  - perfmon, oprofile
  - valgrind
- Basic constructs
  - Concurrency, locks
  - Interrupts, sleeping
  - Atomic access
  - Memory allocation
- VFS layer
- /proc
- Char and block devices
- Driver model
  - sysfs, kobjects
  - buses, devices
- USB Overview
  - Protocol overview
  - Registering drivers
  - uevents
- Useful development paradigms
- Unit testing

23 August 2008

Ian Wienand: Spice hacking

Sick of a ridiculous pile of tiny bottles falling everywhere in your cupboard?

Many people will sell you many expensive solutions. However, I think we've hit on something cheap, effective and practical; zip-lock bags in a deep tub, labelled with sticky tags.

Easy to find.
Easy to get any sized spoon into the bags, or for tipping if you just open a corner.
For freshness, you can vaccuum seal with a straw a-la Alton Brown.
Buy the cheap spices in bulk and pay significantly less.
The more expensive bags with two rows of sealing are a little easier to close, if they come in a good size for your container.
Label with the date if you know it.

Next.

Previous.